home
***
CD-ROM
|
disk
|
FTP
|
other
***
search
/
Libris Britannia 4
/
science library(b).zip
/
science library(b)
/
INFO
/
PCCDEMO.ZIP
/
COMP1.EXE
/
FAULT.PRS
< prev
next >
Wrap
Text File
|
1993-12-20
|
10KB
|
155 lines
àÇöïô ôÄïäæÇìô ÆÿÆôäîÆ ╧╧╬╠╬╠╬╠╬╠╡
σΣα≤⌠±Σ α±≤ΦΓδΣ
┌──┐
│ │ │ │
┼─┌─┐┌ ┌ │├ ─├─ ┌─┐ │┌─┐┌─┐┌─┐┌─┐┌─┐┌─┐
│ ┌─┤│ │ ││ │ │ │ │├─┘│ ┌─┤│ ││ ├─┘
┴ └─┘└─┘ ┴└─┘ └─┘└─┘ ┴└─┘┴ └─┘└ └└─┘└─┘
ü√ Çπα∞ âεφφΦ≥εφ
Before any discussion of Fault │ with a baseline premise.
Tolerant systems can begin we │ Fault Tolerant systems must
must define the term. │ tolerate faults.
Unfortunately ╨╥╥╥╥╥╥╥╥╥╥╥╥╥╥╥╥╥╥╥╥╥╥╥╥╥╥╥╥╥╥╥╙ This may seem
there seem to ╢ σα⌠δ≤-≤εδΣ±αφΓΣ Φ≥ α⌡αΦδαßδΣ █ self evident but
be many ╢ εφ ≤τΣ πΣ≥Ω≤ε∩ ìÄû, ß⌠≤ █ we need to examine
varied inter- ╢ α≤ α ∩±ΦΓΣ. █ what this really
pretations on ╤▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄█ means. My defini-
what fault tolerant actually │ tion of fault tolerant is
means, so let's at least start │ somewhat more encompassing:
Ç σα⌠δ≤ ≤εδΣ±αφ≤ ≥√≥≤Σ∞ ∞⌠≥≤ │ ∞ΦφΦΓε∞∩⌠≤Σ± α±Σα≥. In these
ßΣ Γα∩αßδΣ εσ Γεφ≤Φφ⌠Φφµ │ areas price was a lesser concern
ε∩Σ±α≤Φεφ Σ⌡Σφ Φσ α ∞αΘε± ∩ε±≤Φεφ │ than performance or reliability.
εσ ≤τΣ ≥√≥≤Σ∞ ßΣΓε∞Σ≥ Φφε∩Σ±αßδΣ │ As such many diverse systems were
ε± ≥⌠σσΣ±≥ πα∞αµΣ. │ implemented to handle the
│ inevitable component breakdown
Note that I haven't said that │ and be able to recover or to
the system should be immune to │ carry on regardless. With the
faults, but must be capable of │ increasing insistance on price
continuing near-normal operation │ performance and reliability in
in the face of faults which would │ office systems, some of the
cripple a lesser system. To that │ mainframe techniques have found
end a number of vendors have │ favour in a modified form for the
spent a lot of time and money │ PC architecture machines.
developing systems which try to │
meet these goals. │ éτΦΣσδ√ α∞εφµ≥≤ ≤τΣ≥Σ
│ ≤ΣΓτφΦ≡⌠Σ≥ Φ≥ ≤τα≤ εσ πΦ≥Ω
│ ±Σπ⌠φπαφΓ√. A system of redundant
îαφ√ εσ ≤τΣ σα⌠δ≤ ≤εδΣ±αφ≤ │ disks called æÇêâ (Redundant
≥√≥≤Σ∞≥ ≤τα≤ ÷Σ ≥ΣΣ ≤επα√ τα⌡Σ │ Array of Inexpensive Disks) is
≤τΣΦ± ±εε≤≥ Φφ ≤τΣ ∞αΦφσ±α∞Σ αφπ │ leading the way in cost-effective
data redundancy. æÇêâ was │ implementation as the disk
originally defined as a system of │ capacity needed is double that
5 levels of redundancy by │ normally required.
Patterson, Gibson and Katz in │
1987. The æÇêâ Implementation │ æÇêâ £ ≥≤ε±Σ≥ äéé (ä±±ε±
Level defines the type of system │ éε±±ΣΓ≤Φφµ éεπΣ) εφ ±Σπ⌠φπαφ≤
and covers a range of techniques. │ π±Φ⌡Σ≥. As most drives store ECC
│ data at the end of each sector
æÇêâ ïΣ⌡Σδ ¢ Φ≥ ßα≥ΦΓ πΦ≥Ω │ this level is hardly ever used.
∞Φ±±ε±Φφµ. By this we mean that │ æÇêâ ¥ Φ≥ α τΦµτ-∩Σ±σε±∞αφΓΣ
disks are arranged in pairs with │ ≥√≥≤Σ∞ ≤τα≤ ⌠≥Σ≥ ∞⌠δ≤Φ∩δΣ π±Φ⌡Σ≥
data being written to both disks │ αφπ "≥≤±Φ∩Σ≥" ≤τΣ Φφσε±∞α≤Φεφ σε±
at the same time. When data is │ ΣαΓτ ßδεΓΩ αΓ±ε≥≥ ∞⌠δ≤Φ∩δΣ
read from a disk and a disk error │ π±Φ⌡Σ≥, αφπ ⌠≥Σ≥ α ∩α±Φ≤√ ßδεΓΩ
results, the disk is reported as │ εφ α ±Σπ⌠φπαφ≤ π±Φ⌡Σ. The block
having an error and the data is │ is then read from all drives
taken from the "good" disk. This │ simultaneously and the parity
system requires 2 drives of equal │ checked against the parity
capacity to give you a single │ stripe. This system offers
effective drive. Obviously cost │ increased cost-effectiveness over
becomes a major factor in this │ RAID 1 in that only one extra
drive is required in a system. │ ≥ΦφµδΣ π±Φ⌡Σ. There is still a
This means that the more drives │ separate parity drive so that
you use, the less the extra cost │ overlapped reads cannot be
involved. Most RAID 3 systems │ achieved as each read requires a
use between 3 and 8 drives │ read of the parity drive. This
requiring between 50% and 15% │ limitation limits the use of RAID
extra cost for the parity drive. │ and it is seldom used.
In order to achieve good │
throughput the drive spindles │ æÇêâ ƒ ΣδΦ∞Φφα≤Σ≥ ≤τΣ ∩±εßδΣ∞≥
should be synchronised, otherwise │ ÷Φ≤τ æÇêâ ₧ Φφ ≤τα≤ ∩α±Φ≤√
a block read may require waiting │ Φφσε±∞α≤Φεφ Φ≥ ≥≤ε±Σπ Φφ α ±ε⌠φπ-
for each drive in turn spinning │ ±εßΦφ σα≥τΦεφ α∞εφµ≥≤ αδδ π±Φ⌡Σ≥
to the required point. As most │ Φφ ≤τΣ α±±α√. It has the
drives do not allow motor │ advantages of RAID 3 in that the
synchronising, this system is │ extra cost is limited, but
relatively uncommon. │ suffers in performance when
│ compared with RAID 1. Still RAID
æÇêâ ₧ Φ≥ ≥Φ∞Φδα± ≤ε æÇêâ ¥ ß⌠≤ │ 5 is probably the most cost-
±Σ∞ε⌡Σ≥ ≤τΣ ≥√φΓτ±εφΦ≥α≤Φεφ │ effective system in redundant
∩±εßδΣ∞≥ ß√ ±Σ≡⌠Φ±Φφµ α πΦ≥Ω │ disk technology.
ßδεΓΩ ≤ε ßΣ ±Σ≥≤±ΦΓ≤Σπ ≤ε α │
Does this mean that fault │ (Uninterruptable Power Supplies)
tolerance is just RAID? Well not │ but more a sophisticated method
really. RAID is simply a disk │ of ensuring a clean transition
management technology. If your │ from the powered-on to the
power supply disappears you can │ powered-off state and back. They
forget about RAID doing thing one │ are really just glorified short-
about the situation. Fault │ term battery backups with
tolerance must also address │ intelligence to let the computer
issues of power supply, CPU │ know that it has a short time to
redundancy, etc..etc. So let's │ close all its files and do an
have a look at what's available │ orderly shutdown before power
there. │ will be removed. When the power
│ comes back on the system will
Åε÷Σ± ≥⌠∩∩δ√. │ wait until it has recharged its
│ batteries enough for another
In the main the best form of │ power-down before starting the
defence in this area is to use a │ computer up again.
UPS. Most UPS suppliers now │
provide low cost "Intelligent │ This obviously is not ideal in
UPS" units. These are │ a "non-stop" environment but for
technically not UPS │ most of us it does what is really
needed, and that is the │ file system either as standard or
guaranteed orderly shutdown and │ optional features. This file
restart. As more and more │ system is very much like that
systems start using virtual │ used in the mainframe arena. It
memory, more and more file │ uses a transaction logging
information is transient and │ mechanism that allows the file
subject to problems at power- │ system to be checkpointed so that
down. This is the next point we │ information that is waiting to be
can look at. │ written to disk when the power
│ goes down is actually logged as
àΦδΣ Æ√≥≤Σ∞≥. │ transactions to be completed.
│ When the system returns to power
There is a lot of noise being │ the file system is returned to
made about the "new" technologies │ its last checkpoint and the
of NT versus UnixWare and how │ pending transactions are
these (and other) systems are │ completed.
improving the lot of multi-user, │
high-power workstation users. It │ This is not the only "safe"
just turns out that both of these │ filesystem technology around but
offerings (and SCO for that │ must have something going for it.
matter) all offer the Veritas │ This does not preclude the use of
a UPS but does make for a much │ Ä≤τΣ± Φ≥≥⌠Σ≥.
safer system. │
│ Obviously there are other
éÅö. │ issues that contribute to fault
│ tolerance. Some of these are
The only other major point │ security related and so should
that can fail is the CPU. This │ properly be discussed elsewhere,
is being addressed by a number of │ whilst others relate to such
vendors supplying symmetrical │ things as network access to
mulit-processing systems. In │ multiple machines for users.
these all CPU's are capable of │ This really comes under the title
running all jobs (i.e. there is │ of network management and needs
no master/slave relationship) and │ another few pages to discuss.
so should any CPU become │
unavailable it's jobs can be │ Basically what we have seen is
evenly distributed across the │ that fault-tolerance is available
remaining processors. │ on the desktop NOW, but at a
Obviously this is an expensive │ price. The level of fault
alternative, but depending on │ tolerance you want is controlled
your environment, may well be │ mainly by your wallet and your
worth the cost. │ environment ñ